This project’s broad goal of this project is to explore the IMDB movie dataset and address exciting issues such as the genre of movies that are mainly produced, IMDB score analysis, and profitability analysis. Moreover, if possible, I want to identify what set of features contribute to a highly rated/profitable film most significantly and try to predict a movie’s profitability.
Load the requried R packages.
The dataset used in this project is the IMDB 5000 Movie Dataset from Kaggle, which can be accessed via this link: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset. This dataset recorded information on 5043 movies across 66 countries from 1916 to 2016. The dataset is available in a csv format file and is of size 1MB.
Each record in this dataset has 28 variables, including information such as Title of the movie'',Name of the movie director’‘, Country the movie was produced in'',Budget of the movie ($)’‘, profitability'', andIMDB score for the movie (out of 10)’’.
Import the data and show the dimension and names of all attributes of the data. There are 5043 movies recorded in this dataset, and each of which has 28 attributes. Note that as Kaggle requires a username and password to download the dataset, I am sourcing the same data from my Github repository.
## [1] "tbl_df" "tbl" "data.frame"
## [1] 5043 28
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
First, removing spurious characters from the movie title, genre and plot keyword columns.
Second, remove the duplicates in the data in “movie_title” column as duplicate data may skew analysis. Thus, these 126 duplicate movie were removed.
## [1] 126
Third, there are columns that contain currency, such as the “budget” and “gross”. These columns for a few countries were not converted to US dollars, including “South Korea”, “Japan”, “Thailand”, …, etc. This might cause problems in later analysis and make the problem even more complicated, if taking inflation into consideration. Thus, only movies from USA were kept for profitability analysis.
## # A tibble: 4,998 x 4
## movie_title budget country gross
## <chr> <dbl> <chr> <int>
## 1 The Host 12215500000 South Korea 2201412
## 2 Lady Vengeance 4200000000 South Korea 211667
## 3 Fateless 2500000000 Hungary 195888
## 4 Princess Mononoke 2400000000 Japan 2298191
## 5 Steamboy 2127519898 Japan 410388
## 6 Akira 1100000000 Japan 439162
## 7 Godzilla 2000 1000000000 Japan 10037390
## 8 Kabhi Alvida Naa Kehna 700000000 India 3275443
## 9 Tango 700000000 Spain 1687311
## 10 Kites 600000000 India 1602466
## # ... with 4,988 more rows
Then, I created a column “profit_flag” to indicate if a movie is profitable, i.e, Revenue \(>\) Budget, where 1 means profitable. As this involving both “gross” and “budget” columns, only movies from USA have non-empty value for this column.
Last, ragarding records that contain missing values, in order to keep the entire dataset as complete as possible, I decided to not remove any rows with missing data and to deal with this issue for each individual analysis. For example, when doing genre-wise analysis, records that don’t have values for genre variables are excluded from the analysis.
Columns that contain most “NA” entries are “gross”, “budget”, and “aspect_ratio”.
## # A tibble: 30 x 2
## `Column Name` NA_Count
## <chr> <int>
## 1 gross 874
## 2 budget 487
## 3 aspect_ratio 327
## 4 title_year 107
## 5 director_facebook_likes 103
## 6 num_critic_for_reviews 49
## 7 actor_3_facebook_likes 23
## 8 num_user_for_reviews 21
## 9 duration 15
## 10 facenumber_in_poster 13
## # ... with 20 more rows
After finishing all the cleaning process, there are 3768 rows that do not have any missing value.
## [1] 3768
Here is the preview of the data.
The following table lists the name, type, and description of each variable in the dataset.
| Name | Type | Description |
|---|---|---|
| color | character | Colorization: Color or Black and White |
| director_name | character | Name of the director |
| num_critic_for_reviews | integer | Number of Critical Reviews |
| duration | integer | Duration of the movie in Minutes |
| director_facebook_likes | integer | Number of FB Page Likes of Director |
| actor_3_facebook_likes | integer | Number of FB Page Likes of Actor No.3 |
| actor_2_name | character | Name of Actor No.2 |
| actor_1_facebook_likes | integer | Number of FB Page Likes of Actor No.1 |
| gross | integer | Gross Earned by the movie in US Dollars |
| genres | character | Genres of the movie |
| actor_1_name | character | Name of Actor No.1 |
| movie_title | character | Title of the Movie |
| num_voted_users | integer | Number of Voted Users on IMDB |
| cast_total_facebook_likes | integer | Total FB Page Likes of of the Entire Cast |
| actor_3_name | character | Name of Actor No.3 |
| facenumber_in_poster | integer | Number of the Actors Featured in the Movie Poster |
| plot_keywords | character | Keywords describing the movie plot |
| movie_imdb_link | character | IMDB Link of the Movie |
| num_user_for_reviews | integer | Number of Users who Reviewed the Movie |
| language | character | Language of the movie |
| country | character | Country where the Movie was Produced |
| content_rating | character | Content rating |
| budget | double | Budget of the movie in US Dollars |
| title_year | integer | Year |
| actor_2_facebook_likes | integer | Number of FB Page Likes of Actor No.2 |
| imdb_score | double | IMDB Score on a Scale of 1 to 10 |
| aspect_ratio | double | Aspect Ratio |
| movie_facebook_likes | integer | Number of FB Page Likes of the Film |
| genres_new | character | Edited genres |
| plot_keywords_new | character | Edited plot_keywords |
Construct the document-term matrix for genres.
## <<DocumentTermMatrix (documents: 4998, terms: 26)>>
## Non-/sparse entries: 14382/115566
## Sparsity : 89%
## Maximal term length: 11
## Weighting : term frequency (tf)
Use document-term matrix to calculate frequency for each genre
## # A tibble: 26 x 2
## genre count
## <chr> <dbl>
## 1 drama 2571
## 2 comedy 1862
## 3 thriller 1396
## 4 action 1143
## 5 romance 1098
## 6 adventure 914
## 7 crime 883
## 8 sci-fi 611
## 9 fantasy 604
## 10 horror 556
## # ... with 16 more rows
Plot distribution of genres frequency. We can see that the top 3 movie genres are Drama, Comedy, and Thriller.
Now, we want identify which genre tend to have higher budget/gross/profit? As mentioned previously, this part only dealing with movies from USA due to the currency conversion issue.
Calculate budget, gross, and profit (= gross - budget) for each genre.
## # A tibble: 23 x 4
## genres_new mean_gross mean_budget mean_profit
## <chr> <dbl> <dbl> <dbl>
## 1 Action 86199950. 71082359. 15117591.
## 2 Adventure 108673001. 82183258. 26489743.
## 3 Animation 121531812. 86206545. 35325266.
## 4 Biography 46089398. 29135662. 16953735.
## 5 Comedy 54927568. 35433745. 19493823.
## 6 Crime 45009068. 34262435. 10746633.
## 7 Documentary 14683626. 4021501. 10662126.
## 8 Drama 42694443. 29676037. 13018406.
## 9 Family 95835029. 66128159. 29706870.
## 10 Fantasy 91606040. 66912956. 24693085.
## # ... with 13 more rows
Compare them using a grouped bar chart.
This part of the analysis aims to explory the nationality component of the data. First, we show the number of movies produced in each country through the years, as presented in the following heat map. We can see that most of the countries started producing movies in the early 2000s, except a handful which had prevalant movie production going on since the mid 1900s. Here are some insights that I’ve found: 1. US has the most thriving movie industry, and movies are being produced since the early-mid nineties. 2. Japan, Italy, germany and France being only other countries which produced significant number of movies before 1980s.
English is the language in which most movies are made, and USA produces movies in 14 languages, most by any country.
In here, I have tried to see which kind of movies are more successful in terms of the IMDB ratings.
We first start by looking at the basic central tendency (mean) and the variation in movie score. For this purpose I have plotted a histogram which also has the 5th and 95th percentile mark for the IMDB score.
Summary statistics of IMDB score. Average movie IMDB Score is 6.4 and 90% of movies have a score between 8.1 and 4.3. IMDB scores follow a bell shaped distribution. So any movie having a score of more than 8.1 would be one of the top 5% movies in the world.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 5.800 6.600 6.441 7.200 9.500
IMDB score distribution.
Top 15 movies with highest IMDB score.
## # A tibble: 20 x 2
## movie_title avg_imdb
## <chr> <dbl>
## 1 "Towering Inferno " 9.5
## 2 "The Shawshank Redemption " 9.3
## 3 "The Godfather " 9.2
## 4 "Dekalog " 9.1
## 5 "Kickboxer: Vengeance " 9.1
## 6 "Fargo " 9
## 7 "The Dark Knight " 9
## 8 "The Godfather: Part II " 9
## 9 "12 Angry Men " 8.9
## 10 "Pulp Fiction " 8.9
## 11 "Schindler's List " 8.9
## 12 "The Good, the Bad and the Ugly " 8.9
## 13 "The Lord of the Rings: The Return of the King " 8.9
## 14 "Daredevil " 8.8
## 15 "Fight Club " 8.8
## 16 "Forrest Gump " 8.8
## 17 "Inception " 8.8
## 18 "It's Always Sunny in Philadelphia " 8.8
## 19 "Star Wars: Episode V - The Empire Strikes Back " 8.8
## 20 "The Lord of the Rings: The Fellowship of the Ring " 8.8
Top 15 directors with highest average IMDB score.
## # A tibble: 22 x 2
## director_name avg_imdb
## <chr> <dbl>
## 1 John Blanchard 9.5
## 2 Cary Bell 8.7
## 3 Mitchell Altieri 8.7
## 4 Sadyk Sher-Niyaz 8.7
## 5 Charles Chaplin 8.6
## 6 Mike Mayhall 8.6
## 7 Damien Chazelle 8.5
## 8 Majid Majidi 8.5
## 9 Raja Menon 8.5
## 10 Ron Fricke 8.5
## # ... with 12 more rows
In order to understand the relationship between IMDB score, profit and budget, I first plotted a 3D scatter plot using popular visualization package “plotly” to try to have a big picture about it. It is an interactive plot, so we can observe the relationship between them. As previously discussed, this analysis only includes movies from USA.
From the plot, we can see that movies with higher IMDB score tend to have higher profit and significant number of movies end up losing money. Intuitively, IMDB score and groos may be correlated since people prefer to watch famous and highly-rated movies.
Commercial Success v.s. Critical Acclaim for movie from USA.
These are the top 15 movies with highest Profit, along with its profit and Return on Investment. Note that the bigger a point is, the higher its ROI is. For movies with budget over 70 millions dollars, we can observe an upward trend close to linear, which can be inferred that bigger-budget movies tend to earn more profit. However, there’s a downward trend when the budget is less than 70 millions dollars. Having a closer look at movies in this region, I found most of them produced in the 80s or early 90s, and so, their true budget should be higher with inflation being taken into consideration.
Nonetheless, the profit earned does not give a whole picture about monetary success of a movie throughout the years, so, in this case, Return on Investment is perhaps more suitable to describe the a movie’s profitability. Thus, here is top 15 Movies with highest Return on Investment for movies of at least 10 millions dollars budget.
Top 15 directors with highest average Gross Earned, Profit, and Return on Investment. (R Core Team 2019)
## # A tibble: 15 x 4
## director_name avg_profit avg_budget avg_ROI
## <chr> <dbl> <dbl> <dbl>
## 1 Tim Miller 305. 58 526.
## 2 George Lucas 277. 71.0 3902.
## 3 Richard Marquand 277. 32.5 851.
## 4 Irvin Kershner 272. 18 1512.
## 5 Kyle Balda 262. 74 354.
## 6 Colin Trevorrow 253. 75.4 385.
## 7 Chris Buck 251. 150 167.
## 8 Pierre Coffin 237. 72.5 324.
## 9 Lee Unkrich 215. 200 107.
## 10 Joss Whedon 199. 170 76.7
## 11 James Cameron 195. 124. 153.
## 12 Roger Allers 189. 65 419.
## 13 William Cottrell 183. 2 9146.
## 14 Pete Docter 158. 155 108.
## 15 Francis Lawrence 151. 121. 120.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.